Rare variant contribution to cholestatic liver disease in a South Asian population in the United Kingdom

This study assessed the contribution of five genes previously known to be involved in cholestatic liver disease in British Bangladeshi and Pakistani people. Five genes (ABCB4, ABCB11, ATP8B1, NR1H4, TJP2) were interrogated by exome sequencing data of 5236 volunteers. Included were non-synonymous or loss of function (LoF) variants with a minor allele frequency < 5%. Variants were filtered, and annotated to perform rare variant burden analysis, protein structure, and modelling analysis in-silico. Out of 314 non-synonymous variants, 180 fulfilled the inclusion criteria and were mostly heterozygous unless specified. 90 were novel and of those variants, 22 were considered likely pathogenic and 9 pathogenic. We identified variants in volunteers with gallstone disease (n = 31), intrahepatic cholestasis of pregnancy (ICP, n = 16), cholangiocarcinoma and cirrhosis (n = 2). Fourteen novel LoF variants were identified: 7 frameshift, 5 introduction of premature stop codon and 2 splice acceptor variants. The rare variant burden was significantly increased in ABCB11. Protein modelling demonstrated variants that appeared to likely cause significant structural alterations. This study highlights the significant genetic burden contributing to cholestatic liver disease. Novel likely pathogenic and pathogenic variants were identified addressing the underrepresentation of diverse ancestry groups in genomic research.


Results
Genotype to phenotype analysis. Screening of the 5 candidate genes identified 300 variants; and 180 (166 non-synonymous and 14 loss of function (LoF) variants), were included in the analysis (Table 1). Most variants identified were heterozygous unless otherwise specified. A small number of volunteers carried more than one variant, and this is summarised in Supplementary Table 1. None of these volunteers displayed a disease phenotype. Further variant interpretation including scoring details of individual in-silico predictors, the impact of the coding substitution on disease propensity and conservation information for all variants are presented in Supplementary Table 2.
Phenotype to genotype analysis. We were able to validate variants of interest in 15 cases reporting ICP, most of whom had documented raised BA concentrations (Table 2). Further, variants discovered in 10 cases with raised BA but an uncertain or no diagnosis of ICP based on their EHR (Supplementary Table 3). This secondary analysis missed 2 individuals from the primary analysis with a diagnosis of ICP as there were no bile acid concentrations available for them. This analysis demonstrated a pragmatic approach to identifying disease causing variants and demonstrates the value of large genomic cohorts linked to electronic health data records.
Rare variant burden analysis. We observed in cases versus controls a significant enrichment of rare variants in ABCB11 (p = 0.00247). There was no enrichment in ABCB4 (p = 0.39138), ATP8B1 (p = 0.57957), TJP2 (p = 0.17390), or NR1H4 (p = 0.70232). A further Fisher's exact analysis compared percentage of rare variants in ICP cases in Genes & Health compared to Genomics England demonstrating that the rare genetic burden was significantly increased in tier 1 gene candidates in British Pakistani and Bangladeshi (ABCB4, p = 0.0191; ABCB11, p = 0.0191) but not in ATP8B1 (p = 0.4857) NR1H4 (p = 0.2286) or TJP2 (p = 0.1039).

ABCB4 variants.
There was a total of 68 ABCB4 variants identified (Table 1 and Fig. 1). 41 variants fulfilled the inclusion criteria. Variants were identified in people with cholestatic liver disease phenotypes: ICP (n = 5 out of 88 women in the analysis), gallstone disease (n = 7), and cholangiocarcinoma (n = 1) ( Table 3). For some identified variants a known cholestatic phenotype had previously been reported in the literature (n = 9) whilst others had no phenotype reported (n = 19) (Supplementary Table 4). We identified novel LoF variants (n = 4): three frameshift (one associated with an ICP phenotype) and one introduction of a premature stop codon (no associated phenotype) ( Table 4). ABCB11 variants. There were 77 ABCB11 variants identified of which 48 were included in the analysis (Table 1 and Fig. 2). The associated cholestatic liver disease phenotypes identified were: ICP (n = 5 out of 83 women in the analysis), and gallstone disease (n = 10) (Table 3). Some were linked to cholestatic phenotypes previously reported in the literature (n = 14), whilst for 19, no phenotype had been previously reported (n = 19)   Table 5). Likely novel LoF variants were identified (n = 3): a frameshift, a splice-acceptor and an introduction of premature stop codon variant. These variants were not associated with a known disease phenotype (Table 4).   Fig. 3). There were 7 variants associated with gallstone disease; three appeared to be in linkage disequilibrium (LD) noted in three volunteers: M674T, I577V, and H78Q. The R384H variant was associated with gallstone disease and with an ICP phenotype (in separate individuals, n = 2 out of 33 women in the analysis). A further variant was associated with hepatitis-induced liver cirrhosis (I513T), and a final variant (D70N) was associated with liver cirrhosis, secondary malignant neoplasm of liver and bile duct, and gallstone disease (Table 3). In addition, previously reported cholestatic liver disease phenotypes (n = 7) and variants with no previous reported phenotype were seen (n = 15) (Supplementary Table 6). Finally, we identified 4 novel LoF variants: a frameshift/splice region, frameshift, splice-acceptor/coding-sequence, and stop-gain variant. The latter variant was associated with a gallstone disease phenotype (Table 4).

NR1H4 variants.
There were 22 NR1H4 variants in the Genes & Health cohort and 9 variants in the final analysis (Table 1 and Supplementary Fig. 2). We only identified an ICP phenotype (n = 2 out of 33 women in the analysis) (Table 3) and otherwise novel variants that had no previous phenotype reported (n = 7) (Supplementary Table 7). Furthermore, one novel LoF variant was identified without demonstrating a phenotype (Table 4).

TJP2 variants.
There were 83 TJP2 variants identified of which 37 were analysed (Table 1). People with TJP2 variants had ICP (n = 3 out of 103 women in the analysis), gallstone disease (n = 8), previously reported cholestatic liver disease phenotype (n = 4), and 22 did not have a previously reported phenotype (Table 3, Supplementary Table 8). There were two novel LoF variants without a clinical phenotype (Table 4).
Protein structure and molecular modelling. A flow chart illustrating the variants included in this analysis is shown in Supplementary Fig. 3. Results of the protein structure and molecular modelling software tools are presented in Supplementary Table 9. Some novel variants are in regions of transporters for which we can hypothesis a mechanistic impact. Of particular interest are Q1106H in ABCB4 and D191A in ABCB11. These ABC B-family transporters share 48% amino acid identity and are very likely have a common mechanism of action. The two amino acids are conserved www.nature.com/scientificreports/ in both proteins, and we propose that they are involved in energy transduction through the transporter in order to couple the substrate efflux cycle to the ATP binding and catalysis cycle. In ABCB4 and ABCB11, two transmembrane domains (TMDs) bind the transport substrates (phosphatidylcholine (PC) and bile acids), respectively. The conformational changes required for substrate transport are driven by ATP hydrolysis at the interface between two nucleotide binding domains (NBDs). The TMDs and NBDs must therefore be intimately coupled, and this is achieved via four 'coupling helices' (CH) located at the base of the long intracellular loops extending from the transmembrane alpha helices of the TMDs (Supplementary Fig. 4A). Q1106 (ABCB4) and D191 (ABCB11) are particularly interesting because they are located at this interface that is conserved in both ABCB4 and ABCB11. Q1106 is in a groove on the surface of NBD2 where it interacts with CH2 ( Supplementary Fig. 4B).
In the PC-bound conformation of ABCB4, Q1106 forms a weak electrostatic interaction with the peptide bond of G270. In the ATP-bound conformation (from which PC has most likely been released), Q1106 now interacts with Q272 which illustrates the movement of CH2 and its changing juxtaposition with the NBD during the transport cycle; essentially, a hinge region. The geometry of these interactions will not be preserved if Q1106 is replaced by histidine. In ABCB11, this triad is preserved in Q1150 and G295, with E297 providing a conservative change for the glutamine in CH2 (with respect to formation of an equivalent electrostatic bond with Q1150).
In the sole structural model that we have for ABCB11, D191 is in CH1 where it interacts with Y472 in a surface groove of NBD1 and also, intriguingly, with R946 which is in CH4, suggesting that CH1 and CH4 likely work together in energy transduction through the transporter (Supplementary Fig. 4C).
These electrostatic bonds will not be possible if D191 is replaced with alanine. This triad and its bond architecture is perfectly conserved in ABCB4 in the ATP bound conformation through amino acids D166, Y446 and R902. However, in ABCB4 there is also an additional electrostatic interaction between the carbonyl of the D166 peptide bond and the side chain of Q1171. Q1171 (which is conserved in ABCB11) is adjacent to the ABC signature motif 1172 LSGGQ 1176 which is involved in coordination of ATP and provides a direct mechanism for how CH1 influences, and responds to, the ATP catalytic cycle of these transporters.

Discussion
This study identified novel variants implicated in the aetiopathogenesis of cholestatic liver disease that occur uniquely in this British Bangladeshi and Pakistani cohort 36,79-81 . There have not been any other studies of this magnitude analysing the burden of mutational variation in cholestatic liver disease in a large South Asian cohort. Using a genotype to phenotype approach we discovered novel likely pathogenic variants that appear to be unique to this cohort. We then employed a phenotype to genotype analysis using the ICP phenotype as an exemplar, www.nature.com/scientificreports/ which offered a pragmatic interrogation of electronic health records to identify rare genetic variants that are likely pathogenic. Thus, this study improves representation of this distinct population especially as prevalence of cholestatic liver disease is increased in the Genes & Health cohort, e.g. 1.54% are affected by ICP compared to white Europeans (0.62%). This study demonstrates the importance of multi-ancestry genomic research and offers the potential of tailored treatment for this population.
In the Genes & Health cohort, out of 194 variants meeting inclusion criteria we identified 53 that had a cholestatic liver disease phenotype reported in their linked EHR. Of those, 16 are unique to this British Bangladeshi and Pakistani population and a large number were predicted to be likely pathogenic or known pathogenic based on in-silico prediction tools. In addition, there were 35 variants that were previously reported in the literature with a cholestatic phenotype. However, 87 variants had no previously reported phenotype; 67 were novel (34% of all variants analysed in this study) as they were also not previously reported in the GnomAD population database. Despite that, 9 were considered likely pathologic and 5 known pathogenic. It is important to consider that heterozygosity as noted in most cases means that they are likely rescued by the wild-type allele but at higher risk of disease in later life or during times of liver stress, e.g. during pregnancy.
Our findings reflect the difficulty with interpretation of rare variants in clinically important genes when there is no previous evidence in the literature or functional data to interpret them further. The ACMG rare variant interpretation guideline 30 provides a standardised analysis pathway. However, it relies in part on the interpretation of the variant in the context of the literature and does not account for specific genes and diseases. It also may not be robust for flexible membrane proteins which do not work by lock and key mechanism. For example, the ABCB11 variant V444A, considered as benign by the ACMG criteria, has been reported to increase the risk of ICP, hepatitis C disease progression, and drug-induced liver injury although the exact functional mechanisms are not clear yet 55,82 .
By employing computational protein modelling software tools, we were able to identify variants that likely have a significant impact on the conformation of the protein and could therefore be of clinical significance. It is important to bear in mind that all these tools have inherent flaws and are beyond the scope of this paper to discuss in detail. By taking ICP as a cholestatic liver disease example we were able to highlight further difficulties with rare variant interpretation in gestational syndromes as the inherent transient nature of the disease makes variant www.nature.com/scientificreports/ interpretation challenging. However, ICP is a clinically relevant example as the gestational disease consequences are not just relevant to their current pregnancy but also can result in later hepatobiliary disorders such as cancer, immune-mediated and cardiovascular diseases 83 . In addition, they have a higher gallstone-related morbidity and a strong positive association between ICP and hepatitis C exists as well 84 .

Limitations
The use of electronic health records to determine phenotype is extremely useful but dependent upon appropriate information having been coded. Participants with at-risk variants may not have presented yet with symptoms of disease but still be at high risk of developing complications at a later stage in their life, particularly given that the median age of volunteers in this study was around 45 years. It demonstrates the difficulty with interpreting variants when recalling the genotype first.

Conclusions
In this study we provide the first comprehensive evaluation of gene candidates associated with cholestatic liver diseases in a unique cohort of British Bangladeshi and Pakistani origin demonstrating the importance of characterising genetic disease in diverse ethnic groups. Our findings have demonstrated the increased mutational burden of cholestatic liver disease in British Bangladeshi and Pakistani people who thus far remain understudied despite their distinct genetic background and increased risk of developing ICP in comparison to other populations. We were able to identify novel variants that have not been previously identified and are likely implicated in disease. We demonstrated the ability to identify participants at risk both by a phenotype or genotype first approach. This demonstrates the importance of providing more personalised care in a clinical setting as identification of high-risk individuals and their family members enables early intervention to prevent adverse outcomes, for example hepato-protective drugs such as UDCA, in addition to hepatic surveillance. Furthermore, it provides the necessary foundation for improved therapy and drug development.

Study population.
A detailed description of the Genes & Health cohort has been described by Finer et al. 21 .
Ethical approval for the study was provided by the South East London National Research Ethics Committee (14/LO/1240) including consent for publishing http:// www. genes andhe alth. org/ volun teer-infor mation 22 . All Genes & Health volunteers consented to lifelong EHR linkage, DNA extraction and genetic tests. All research was conducted in accordance with NHS Health Research Authority guidelines and regulations. An individual application to support data access for this study was granted by Genes & Health (reference S00037) taking into consideration community prioritisation, acceptability and scientific merit. Exome sequencing samples from 5236 Genes & Health volunteers reporting parental relatedness were available for analysis in variant call format files. For the initial analysis a genotype to phenotype approach was employed interrogating 5 gene candidate loci (Table 5). For the rare burden analysis female volunteers without ICP served as controls (n = 3048). In a secondary analysis, a phenotype to genotype analysis was used to validate these findings, using ICP as the exemplar. For this approach, electronic health records allowed total serum bile acid concentrations ≥ 10 µmol to be retrieved from a network of acute hospitals that provide maternity care to (n = 15,500) women per year living in east London to identify patients with a diagnosis of liver disease in pregnancy (ICD 10 diagnosis code O26.6), see Supplementary Fig. 1. Maternal health records were screened by an experienced clinician to verify a diagnosis of ICP.
Exome sequencing & bioinformatic pipeline. Low/mid exome sequencing was performed as previously described 85 . The exome sequencing data is being held under a data access agreement at the European Genotype-phenome Archive (www. ebi. ac. uk/ ega) under accession numbers EGAD00001005469. Minor allele frequency (MAF) was set at < 5% to include rare and low-frequency genetic variants to allow for a comprehensive evaluation.
Variant annotation. All protein-altering missense, non-sense, frameshift indels or splice site variants identified in the candidate gene set underwent the same processing as described below. Synonymous variants were excluded from further analysis. Variants were filtered and annotated if they met any of the following inclusion criteria (MAF < 5%): 1. associated with a phenotype; 2. known in the literature; 3. no recorded GnomAD allele frequency; 4. predicted to be likely pathogenic (LP) based on all 7 in-silico predictors. To assess the likelihood of functional impact of variants a variety of in-silico tools were employed, including Polyphen 86 , SIFT 87 ,  91 . Open-source databases (Leiden Open Variation Database-LoVD, and ClinVar) including a commercial database (Mastermind) were interrogated to assess whether variants were reported previously in the literature. The American College of Medical Genetics/Association for Molecular Pathology (ACMG/AMP) guidelines for rare variant interpretation was used to assess variant pathogenicity. The guidelines consider a variant to be LP if there is a > 90% certainty it being disease-causing, but below a higher "pathogenic" threshold 92 .

Rare variant burden analysis.
To assess the significance of any rare variant burden of SNPs in all 5 gene candidates the exactCMC function in RVTESTS 93 was used. The burden was calculated as the proportion of all ICP cases versus control in the Genes & Health cohort and who had at least one alternate allele. Variants with an allele frequency of 0.01 or less were included. ICP cases (n = 18) from the Genomics England database (Project ID 747)-a predominantly European genetic cohort-were accessed to serve as a direct comparison to rare variant burden in Genes & Health.
Protein structure and modelling analysis. Inclusion criteria for further protein structure and modelling analysis required variants to have an associated phenotype, and/or be predicted to be known pathogenic based on all 7 in-silico tools. VarMap was used to assess protein sequence variants 94 . 2D representations were designed using the open source tool TOPO2 95 . Variants that were LP or associated with a phenotype underwent further analysis using: (1) Dynamut 96 (2) CUPSAT 97 (3) SNPMuSiC 98 . 3D structural representations were generated using PyMOL software (The PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC.) using ABCB4 with phosphatidylcholine (PDB ID: 7NIV) 99 , ABCB11 open structure (PDB ID: 6LR0) 100 , ATP8B1 (PDB ID: 7PY4) 101 , and NR1H4 (PDB ID: 1OSH) 102 . There is no 3D protein structure available for TJP2.

Data availability
The imputed genotype data from Genes & Heath are available on EGA (www. ega-archi ve. org; study accession number: EGAS00001005373). The individual-level phenotypic data will be made available to researchers by completing an application to Genes & Health, following their open access policy described on https:// www. genes andhe alth. org/ resea rch/ scien tists-using-genes-health-scien tific-resea rch.